Making predictions over amazon fine food reviews dataset

Predictions

The purpose of this analysis is to make up a prediction model where we will be able to predict whether a recommendation is positive or negative. In this analysis, we will not focus on the Score, but only the positive/negative sentiment of the recommendation.

To do so, we will work on Amazon's recommendation dataset, we will build a Term-doc incidence matrix using term frequency and inverse document frequency ponderation. When the data is ready, we will load it into predicitve algorithms, mainly naïve Bayesian and regression.

In the end, we hope to find a "best" model for predicting the recommendation's sentiment.

Loading the data

In order to load the data, we will use the SQLITE dataset where we will only fetch the Score and the recommendation summary.

As we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "postive". Otherwise, it will be set to "negative".

The data will be split into an training set and a test set with a test set ratio of 0.2


In [2]:
%matplotlib inline

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

Let's first check whether we have the dataset available:


In [3]:
import os
from IPython.core.display import display, HTML
    
if not os.path.isfile('database.sqlite'):
    display(HTML("<h3 style='color: red'>Dataset database missing!</h3><h3> Please download it "+
          "<a href='https://www.kaggle.com/snap/amazon-fine-food-reviews'>from here on Kaggle</a> "+
          "and extract it to the current directory."))
    raise(Exception("missing dataset"))


Dataset database missing!

Please download it from here on Kaggle and extract it to the current directory.

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-3-86b8e421874f> in <module>()
      6           "<a href='https://www.kaggle.com/snap/amazon-fine-food-reviews'>from here on Kaggle</a> "+
      7           "and extract it to the current directory."))
----> 8     raise(Exception("missing dataset"))
      9 

Exception: missing dataset

In [ ]:
con = sqlite3.connect('database.sqlite')

pd.read_sql_query("SELECT * FROM Reviews LIMIT 3", con)

Let's select only what's of interest to us:


In [ ]:
messages = pd.read_sql_query("""
SELECT 
  Score, 
  Summary, 
  HelpfulnessNumerator as VotesHelpful, 
  HelpfulnessDenominator as VotesTotal
FROM Reviews 
WHERE Score != 3""", con)

Let's see what we've got:


In [ ]:
messages.head(5)

Let's add the Sentiment column that turns the numeric score into either positive or negative.

Similarly, the Usefulness column turns the number of votes into a boolean.


In [ ]:
messages["Sentiment"] = messages["Score"].apply(lambda score: "positive" if score > 3 else "negative")
messages["Usefulness"] = (messages["VotesHelpful"]/messages["VotesTotal"]).apply(lambda n: "useful" if n > 0.8 else "useless")

messages.head(5)

Let's have a look at some 5s:


In [ ]:
messages[messages.Score == 5].head(10)

And some 1s as well:


In [ ]:
messages[messages.Score == 1].head(10)

Extracting features from text data

SciKit cannot work with words, so we'll just assign a new dimention to each word and work with word counts.

See more here: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html


In [9]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

import re
import string
import nltk

cleanup_re = re.compile('[^a-z]+')
def cleanup(sentence):
    sentence = sentence.lower()
    sentence = cleanup_re.sub(' ', sentence).strip()
    #sentence = " ".join(nltk.word_tokenize(sentence))
    return sentence

messages["Summary_Clean"] = messages["Summary"].apply(cleanup)

train, test = train_test_split(messages, test_size=0.2)
print("%d items in training data, %d in test data" % (len(train), len(test)))


420651 items in training data, 105163 in test data

In [10]:
from wordcloud import WordCloud, STOPWORDS

# To cleanup stop words, add stop_words = STOPWORDS
# But it seems to function better without it
count_vect = CountVectorizer(min_df = 1, ngram_range = (1, 4))
X_train_counts = count_vect.fit_transform(train["Summary_Clean"])

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

X_new_counts = count_vect.transform(test["Summary_Clean"])
X_test_tfidf = tfidf_transformer.transform(X_new_counts)

y_train = train["Sentiment"]
y_test = test["Sentiment"]

prediction = dict()

Let's get fancy with WordClouds!


In [11]:
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)

#mpl.rcParams['figure.figsize']=(8.0,6.0)    #(6.0,4.0)
mpl.rcParams['font.size']=12                #10 
mpl.rcParams['savefig.dpi']=100             #72 
mpl.rcParams['figure.subplot.bottom']=.1 


def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=200,
        max_font_size=40, 
        scale=3,
        random_state=1 # chosen at random by flipping a coin; it was heads
    ).generate(str(data))
    
    fig = plt.figure(1, figsize=(8, 8))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()
    
show_wordcloud(messages["Summary_Clean"])


We can also view wordclouds for only positive or only negative entries:


In [12]:
show_wordcloud(messages[messages.Score == 1]["Summary_Clean"], title = "Low scoring")



In [13]:
show_wordcloud(messages[messages.Score == 5]["Summary_Clean"], title = "High scoring")


Create a Multinomial Naïve Bayes model


In [14]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(X_train_tfidf, y_train)
prediction['Multinomial'] = model.predict(X_test_tfidf)

Create a Bernoulli Naïve Bayes model


In [15]:
from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB().fit(X_train_tfidf, y_train)
prediction['Bernoulli'] = model.predict(X_test_tfidf)

Create a Logistic Regression model


In [16]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e5)
logreg_result = logreg.fit(X_train_tfidf, y_train)
prediction['Logistic'] = logreg.predict(X_test_tfidf)

Create a Linear SVC model


In [17]:
from sklearn.svm import LinearSVC
linsvc = LinearSVC(C=1e5)
linsvc_result = linsvc.fit(X_train_tfidf, y_train)
prediction['LinearSVC'] = linsvc.predict(X_test_tfidf)

Analyzing Results

Before analyzing the results, let's remember what Precision and Recall are (more here https://en.wikipedia.org/wiki/Precision_and_recall)

ROC Curves

In order to compare our learning algorithms, let's build the ROC curve. The curve with the highest AUC value will show our "best" algorithm.

In first data cleaning, stop-words removal has been used, but the results were much worse. Reason for this result could be that when people want to speak about what is or is not good, they use many small words like "not" for instance, and these words will typically be tagged as stop-words, and will be removed. This is why in the end, it was decided to keep the stop-words. For those who would like to try it by themselves, I have let the stop-words removal as a comment in the cleaning part of the analysis.


In [18]:
def formatt(x):
    if x == 'negative':
        return 0
    return 1
vfunc = np.vectorize(formatt)

cmp = 0
colors = ['b', 'g', 'y', 'm', 'k']
for model, predicted in prediction.items():
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test.map(formatt), vfunc(predicted))
    roc_auc = auc(false_positive_rate, true_positive_rate)
    plt.plot(false_positive_rate, true_positive_rate, colors[cmp], label='%s: AUC %0.2f'% (model,roc_auc))
    cmp += 1

plt.title('Classifiers comparison with ROC')
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()


After plotting the ROC curve, it would appear that the Logistic regression method provides us with the best results, although the AUC value for this method is not outstanding...

I looks like the best are LogisticRegression and LinearSVC. Let's see the accuracy, recall and confusion matrix for these models:


In [19]:
for model_name in ["Logistic", "LinearSVC"]:
    print("Confusion matrix for %s" % model_name)
    print(metrics.classification_report(y_test, prediction[model_name], target_names = ["positive", "negative"]))
    print()


Confusion matrix for Logistic
             precision    recall  f1-score   support

   positive       0.88      0.84      0.86     16436
   negative       0.97      0.98      0.97     88727

avg / total       0.96      0.96      0.96    105163


Confusion matrix for LinearSVC
             precision    recall  f1-score   support

   positive       0.83      0.85      0.84     16436
   negative       0.97      0.97      0.97     88727

avg / total       0.95      0.95      0.95    105163



In [20]:
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues, labels=["positive", "negative"]):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(labels))
    plt.xticks(tick_marks, labels, rotation=45)
    plt.yticks(tick_marks, labels)
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
# Compute confusion matrix
cm = confusion_matrix(y_test, prediction['Logistic'])
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cm)    

cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')
plt.show()


Let's also have a look at what the best & words are by looking at the coefficients:


In [21]:
words = count_vect.get_feature_names()
feature_coefs = pd.DataFrame(
    data = list(zip(words, logreg_result.coef_[0])),
    columns = ['feature', 'coef'])

feature_coefs.sort_values(by='coef')


Out[21]:
feature coef
967049 worst -44.423977
983288 yuck -32.798962
820601 terrible -30.725437
422085 horrible -29.568083
61664 awful -29.549015
587861 not -29.517080
227249 disgusting -25.736642
569844 nasty -25.434613
966932 worse -23.675443
771474 stale -23.175920
55684 at best -23.045667
743490 sick -22.671042
937834 weak -22.439037
64217 bad -22.389412
393691 gross -22.329017
668716 poor -22.136753
518607 low quality -21.973189
443233 instead -21.778065
222437 didn -21.766492
804948 tasteless -21.114726
58868 avoid -20.743722
697967 rancid -20.091260
598406 not very good -20.060497
348817 good and tangy -18.965442
597882 not too good -18.902264
582723 no flavor -18.842567
983599 yuk -18.803222
669276 poorly -17.863233
546986 moldy -17.835281
440440 inedible -17.749561
... ... ...
272176 fabulous 17.914353
589293 not burnt 18.011484
97636 better than 18.186750
590191 not disappointed 18.459886
516173 loves 18.656262
593258 not like cardboard 19.036431
751726 smooth 19.144529
212288 delicious 19.151504
592145 not greasy 19.260567
285714 finally 19.447863
591578 not from china 19.477991
4758 addictive 19.521455
590828 not expired 19.645489
593574 not made in china 19.661448
278989 favorite 20.092982
412574 heaven 20.168960
348222 good 20.416811
597821 not too 20.731870
17668 amazing 20.782510
508455 love 22.401286
274917 fantastic 24.135847
59431 awesome 24.517802
598472 not very salty 24.682914
596011 not so bad 24.806706
655724 perfect 25.407171
961404 wonderful 26.585667
589143 not bitter 26.717856
588671 not bad 35.914588
368355 great 37.114205
81362 best 42.539952

988938 rows × 2 columns


In [22]:
def test_sample(model, sample):
    sample_counts = count_vect.transform([sample])
    sample_tfidf = tfidf_transformer.transform(sample_counts)
    result = model.predict(sample_tfidf)[0]
    prob = model.predict_proba(sample_tfidf)[0]
    print("Sample estimated as %s: negative prob %f, positive prob %f" % (result.upper(), prob[0], prob[1]))

test_sample(logreg, "The food was delicious, it smelled great and the taste was awesome")
test_sample(logreg, "The whole experience was horrible. The smell was so bad that it literally made me sick.")
test_sample(logreg, "The food was ok, I guess. The smell wasn't very good, but the taste was ok.")


Sample estimated as POSITIVE: negative prob 0.000921, positive prob 0.999079
Sample estimated as NEGATIVE: negative prob 0.999997, positive prob 0.000003
Sample estimated as POSITIVE: negative prob 0.245712, positive prob 0.754288

Now let's try to predict how helpful a review is


In [23]:
show_wordcloud(messages[messages.Usefulness == "useful"]["Summary_Clean"], title = "Useful")
show_wordcloud(messages[messages.Usefulness == "useless"]["Summary_Clean"], title = "Useless")


Nothing seems to pop out.. let's try to limit the dataset to only entries with at least 10 votes.


In [24]:
messages_ufn = messages[messages.VotesTotal >= 10]
messages_ufn.head()


Out[24]:
Score Summary VotesHelpful VotesTotal Sentiment Usefulness Summary_Clean
32 4 Best of the Instant Oatmeals 19 19 positive useful best of the instant oatmeals
33 4 Good Instant 13 13 positive useful good instant
75 5 Forget Molecular Gastronomy - this stuff rocke... 15 15 positive useful forget molecular gastronomy this stuff rockes ...
145 5 tastes very fresh 17 19 positive useful tastes very fresh
195 1 CHANGED FORMULA MAKES CATS SICK!!!! 3 10 negative useless changed formula makes cats sick

Now let's try again with the word clouds:


In [25]:
show_wordcloud(messages_ufn[messages_ufn.Usefulness == "useful"]["Summary_Clean"], title = "Useful")
show_wordcloud(messages_ufn[messages_ufn.Usefulness == "useless"]["Summary_Clean"], title = "Useless")


This seems a bit better, let's see if we can build a model though


In [26]:
from sklearn.pipeline import Pipeline

train_ufn, test_ufn = train_test_split(messages_ufn, test_size=0.2)

ufn_pipe = Pipeline([
    ('vect', CountVectorizer(min_df = 1, ngram_range = (1, 4))),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(C=1e5)),
])

ufn_result = ufn_pipe.fit(train_ufn["Summary_Clean"], train_ufn["Usefulness"])

prediction['Logistic_Usefulness'] = ufn_pipe.predict(test_ufn["Summary_Clean"])
print(metrics.classification_report(test_ufn["Usefulness"], prediction['Logistic_Usefulness']))


             precision    recall  f1-score   support

     useful       0.84      0.88      0.86      2998
    useless       0.76      0.69      0.72      1615

avg / total       0.81      0.81      0.81      4613

Let's also see which of the reviews are rated by our model as most helpful and least helpful:


In [27]:
ufn_scores = [a[0] for a in ufn_pipe.predict_proba(train_ufn["Summary"])]
ufn_scores = zip(ufn_scores, train_ufn["Summary"], train_ufn["VotesHelpful"], train_ufn["VotesTotal"])
ufn_scores = sorted(ufn_scores, key=lambda t: t[0], reverse=True)

# just make this into a DataFrame since jupyter renders it nicely:
pd.DataFrame(ufn_scores)


Out[27]:
0 1 2 3
0 1.000000e+00 best 13 13
1 9.999999e-01 Great for Baking 10 11
2 9.999999e-01 Great for baking! 10 10
3 9.999999e-01 Great for Baking 23 23
4 9.999999e-01 Great for baking 9 10
5 9.999999e-01 Great for baking 32 34
6 9.999999e-01 Great for baking 12 13
7 9.999999e-01 Great for Baking- 14 15
8 9.999999e-01 Great for baking 12 13
9 9.999998e-01 FINALLY 97 99
10 9.999998e-01 Finally! 11 12
11 9.999998e-01 FINALLY 69 74
12 9.999998e-01 Finally! 10 11
13 9.999998e-01 Finally! 20 23
14 9.999995e-01 best bread 24 24
15 9.999994e-01 Best Honey 19 22
16 9.999994e-01 Best honey 28 30
17 9.999988e-01 Best Value 15 15
18 9.999986e-01 best of the best 11 11
19 9.999979e-01 Best popcorn ever 13 14
20 9.999979e-01 Best popcorn ever! 18 18
21 9.999975e-01 I love these 21 21
22 9.999975e-01 Love these! 12 12
23 9.999975e-01 Love these!!! 13 13
24 9.999975e-01 Love these! 14 15
25 9.999972e-01 Best Gluten Free Bread Mix 33 34
26 9.999971e-01 My Dogs LOVE these 14 14
27 9.999971e-01 My Dogs LOVE these 14 14
28 9.999970e-01 A quality product at a good price, but not wha... 33 33
29 9.999967e-01 GREAT ALTERNATIVE! 10 10
... ... ... ... ...
18422 8.361379e-07 Yuck! 6 10
18423 8.361379e-07 Yuck! 4 13
18424 8.361379e-07 Yuck! 5 15
18425 8.361379e-07 YUCK !!! 8 11
18426 8.361379e-07 Yuck! 6 17
18427 8.361379e-07 yuck!! 13 17
18428 2.465316e-07 gross 5 18
18429 2.465316e-07 Gross! 7 13
18430 2.465316e-07 Gross! 7 13
18431 2.465316e-07 Gross! 7 13
18432 2.465316e-07 Gross! 7 13
18433 2.465316e-07 Gross!!!!! 4 10
18434 2.465316e-07 Gross! 7 13
18435 2.465316e-07 Gross! 4 11
18436 2.465316e-07 GROSS 12 17
18437 2.465316e-07 gross 5 18
18438 2.465316e-07 Gross! 7 13
18439 2.465316e-07 Gross! 7 13
18440 2.465316e-07 gross 5 18
18441 2.465316e-07 Gross! 8 16
18442 2.465316e-07 Gross 9 12
18443 2.465316e-07 gross 5 18
18444 2.465316e-07 Gross! 7 13
18445 2.465316e-07 Gross! 7 13
18446 2.465316e-07 gross 5 12
18447 2.465316e-07 GROSS 9 22
18448 6.339364e-08 The worst!! 6 16
18449 6.339364e-08 The worst !!! 1 16
18450 6.339364e-08 The Worst 1 11
18451 6.339364e-08 The worst!! 6 16

18452 rows × 4 columns


In [28]:
cm = confusion_matrix(test_ufn["Usefulness"], prediction['Logistic_Usefulness'])
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cm_normalized, labels=["useful", "useless"])


Even more complicated pipeline


In [29]:
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin

# Useful to select only certain features in a dataset for forwarding through a pipeline
# See: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
    def fit(self, x, y=None):
        return self
    def transform(self, data_dict):
        return data_dict[self.key]

train_ufn2, test_ufn2 = train_test_split(messages_ufn, test_size=0.2)

ufn_pipe2 = Pipeline([
   ('union', FeatureUnion(
       transformer_list = [
           ('summary', Pipeline([
               ('textsel', ItemSelector(key='Summary_Clean')),
               ('vect', CountVectorizer(min_df = 1, ngram_range = (1, 4))),
               ('tfidf', TfidfTransformer())])),
          ('score', ItemSelector(key=['Score']))
       ],
       transformer_weights = {
           'summary': 0.2,
           'score': 0.8
       }
   )),
   ('model', LogisticRegression(C=1e5))
])

ufn_result2 = ufn_pipe2.fit(train_ufn2, train_ufn2["Usefulness"])
prediction['Logistic_Usefulness2'] = ufn_pipe2.predict(test_ufn2)
print(metrics.classification_report(test_ufn2["Usefulness"], prediction['Logistic_Usefulness2']))


             precision    recall  f1-score   support

     useful       0.87      0.89      0.88      2968
    useless       0.79      0.75      0.77      1645

avg / total       0.84      0.84      0.84      4613


In [30]:
cm = confusion_matrix(test_ufn2["Usefulness"], prediction['Logistic_Usefulness2'])
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cm_normalized, labels=["useful", "useless"])



In [31]:
len(ufn_result2.named_steps['model'].coef_[0])


Out[31]:
86527

Again, let's have a look at the best/worst words:


In [32]:
ufn_summary_pipe = next(tr[1] for tr in ufn_result2.named_steps["union"].transformer_list if tr[0]=='summary')
ufn_words = ufn_summary_pipe.named_steps['vect'].get_feature_names()
ufn_features = ufn_words + ["Score"]
ufn_feature_coefs = pd.DataFrame(
    data = list(zip(ufn_features, ufn_result2.named_steps['model'].coef_[0])),
    columns = ['feature', 'coef'])
ufn_feature_coefs.sort_values(by='coef')


Out[32]:
feature coef
9533 blk water -74.777643
50395 not buy this product -72.157559
36243 hot chocolate -68.105937
32185 great for -66.859561
30127 god awful -65.407671
58458 pretty good -57.912338
50659 not fresh -57.749880
2256 and -57.547117
77706 total ripoff -56.597793
61332 really really bad -56.317598
48615 nasty stuff -55.698508
9140 big surprise -55.177211
39658 issues -55.006595
46885 mold -54.870105
66780 splenda alert -54.074558
20784 don like -53.878187
7488 best -53.213790
58483 pretty nice -53.145168
6016 bad product -52.998426
61528 recipe -52.991121
5900 bad beef -52.526385
204 absolutely overpriced -51.797967
84838 wrong item shipped -51.754976
13940 changed -51.517844
20329 dog danger -50.036589
34129 hard plastic -49.770210
25579 flavoring -49.759078
53544 ok not -49.466632
53545 ok not quite -49.466632
53546 ok not quite what -49.466632
... ... ...
48771 nature candy 52.291246
44255 love this brand 52.329345
34990 healthy family 52.529122
48372 my review 52.734235
43504 little lacking 52.868384
77327 tomato food 53.129313
4573 artificial flavor 53.785953
65840 so versatile 54.168931
50750 not good product 56.192680
53337 oh yes 56.421614
80501 very tasty snack 56.615730
77618 top cat 56.941416
17500 crunchy and delicious 58.100071
44075 love blk water 58.219841
44074 love blk 58.219841
31237 good treats 59.110648
34898 healthy and delicious 59.671328
18479 definitely happy 60.045587
78763 unbelievable product 63.489960
43572 little weak 64.329545
18938 delish but 64.895154
78237 truffle oil 65.825549
32361 great gift basket 67.056958
23050 excellent formula 67.515298
7033 beautiful plant 69.531110
32903 great quality product 69.783854
80139 very convenient 70.858109
28935 fun nostalgia 71.358455
49277 new review 72.898088
1719 amazingly good coffee 73.608544

86527 rows × 2 columns


In [33]:
print("And the coefficient of the Score variable: ")
ufn_feature_coefs[ufn_feature_coefs.feature == 'Score']


And the coefficient of the Score variable: 
Out[33]:
feature coef
86526 Score -1.807011

In [ ]: